Word n-grams for cluster keyboards

نویسندگان

  • Nils Klarlund
  • Michael Riley
چکیده

A cluster keyboard partitions the letters of the alphabet onto subset keys. On such keyboards most words are typed with no more key presses than on the standard keyboard, but a key sequence may stand for two or more words. In current practice, this ambiguity problem is addressed by hypothesizing words according to their unigram (occurrence) frequency. When the hypothesized word is not the intended one, an error arises. In this paper, we study the effect of deploying large, n-gram language models used in speech recognition for improving the error rate. We use the North American Business News (NAB) corpus, which contains hundreds of millions of words. We report on results for the telephone keypad and for cluster keyboards with 5, 8, 10, and 14 keys based on the QWERTY layout. Despite our assumption that a word hypothesis must be displayed promptly, we show that the error rate can be reduced to up to one-fourth of the rate of the unigram method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploring Word Embeddings and Character N-Grams for Author Clustering

We presented our system for PAN 2016 Author Clustering task. Our software used simple character n-grams to represent the document collection. We then ran K-Means clustering optimized using the Silhouette Coefficient. Our system yields competitive results and required only a short runtime. Character n-grams can capture a wide range of information, making them effective for authorship attribution...

متن کامل

Multi-Class Composite N-gram Language Model for Spoken Language Processing Using Multiple Word Clusters

In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem for spoken language in that it is difficult to collect training data. The Multi-Class Composite N-gram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, called MultiClasses. In the Multi-Cl...

متن کامل

Multi-class composite n-gram language model using multiple word clusters and word successions

In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem in small amount of training data. The Multi-Class Composite Ngram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, so-called Multi-Classes. In the Multi-Class, the statistical connectivity...

متن کامل

Identi cation of Case, Digits and Special Symbols Using a Context Window

We present strategies and results for identifying the symbol type of every character in a text document. Assuming reasonable word and character segmentation for shape clustering, we designed several type recognition methods that depend on cluster n-grams, characteristics of neighbors, and within-word context. On an ASCII test corpus of 925 articles, these methods represent a substantial improve...

متن کامل

Multi-class composite N-gram based on connection direction

A new word-clustering technique is proposed to efficiently build statistically salient class 2-grams from language corpora. By splitting word neighboring characteristics into word-preceding and following directions, multiple (two-dimensional) word classes are assigned to each word. In each side, word classes are merged into larger clusters independently according to preceding or following word ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003